Home

Monitoring Infrastructure

Overview

In this project, I worked on monitoring applications and infrastructure using AWS services. The ability to monitor applications and infrastructure is critical for delivering reliable, consistent IT services.

Monitoring requirements range from collecting statistics for long-term analysis to quickly reacting to changes and outages. Monitoring can also support compliance reporting by continuously checking that infrastructure is meeting organizational standards.

I learned how to use several AWS monitoring tools:

By the end, I successfully:

Task 1: Installing the CloudWatch Agent

I started by installing the CloudWatch agent on an EC2 instance. The CloudWatch agent is really versatile - it can collect metrics from both EC2 instances and on-premises servers, including:

Here's how I did it:

Action: Install
Name: AmazonCloudWatchAgent
Version: latest

I noticed a message about "Step execution skipped due to unsatisfied preconditions: '"StringEquals": [platformType, Windows]'. Step name: createDownloadFolder" for Windows platforms, but this was expected since I was using a Linux instance, so I safely ignored it. I could select Step 2 - Output instead because the instance was created from a Linux AMI.

Next, I needed to configure the CloudWatch agent to collect web server logs and system metrics. I stored this configuration in AWS Systems Manager Parameter Store:

Name: Monitor-Web-Server
Description: Collect web logs and system metrics
Value: I pasted a JSON configuration that defined:
jsonCopy{ "logs": { "logs_collected": { "files": { "collect_list": [ { "log_group_name": "HttpAccessLog", "file_path": "/var/log/httpd/access_log", "log_stream_name": "{instance_id}", "timestamp_format": "%b %d %H:%M:%S" }, { "log_group_name": "HttpErrorLog", "file_path": "/var/log/httpd/error_log", "log_stream_name": "{instance_id}", "timestamp_format": "%b %d %H:%M:%S" } ] } } }, "metrics": { "metrics_collected": { "cpu": { "measurement": [ "cpu_usage_idle", "cpu_usage_iowait", "cpu_usage_user", "cpu_usage_system" ], "metrics_collection_interval": 10, "totalcpu": false }, "disk": { "measurement": [ "used_percent", "inodes_free" ], "metrics_collection_interval": 10, "resources": [ "*" ] }, "diskio": { "measurement": [ "io_time" ], "metrics_collection_interval": 10, "resources": [ "*" ] }, "mem": { "measurement": [ "mem_used_percent" ], "metrics_collection_interval": 10 }, "swap": { "measurement": [ "swap_used_percent" ], "metrics_collection_interval": 10 } } } }

I examined the configuration and found it defined the following items to be monitored:

I clicked Create parameter to store this parameter for reference when starting the CloudWatch agent.

After creating the parameter, I started the CloudWatch agent on the web server:

Action: configure
Mode: ec2
Optional Configuration Source: ssm
Optional Configuration Location: Monitor-Web-Server
Optional Restart: yes

At this point, the CloudWatch agent was running and sending log and metric data to CloudWatch.

Task 2: Monitoring Application Logs Using CloudWatch Logs

CloudWatch Logs lets me monitor applications and systems using log data. For example, CloudWatch Logs can track the number of errors that occur in application logs and send a notification whenever the rate of errors exceeds a threshold that I specify.

CloudWatch Logs uses existing log data for monitoring, so no code changes are required. For example, I can monitor application logs for specific literal terms (such as "NullReferenceException") or count the number of occurrences of a literal term at a particular position in log data (such as 404 status codes in a web server access log). When the term being searched for is found, CloudWatch Logs reports the data to a CloudWatch metric that I specify. Log data is encrypted while in transit and while it is at rest.

The Web Server generates two types of log data:

I generated some log data on the Web Server to monitor with CloudWatch Logs:

This demonstrated how log files can be automatically shipped from an EC2 instance or an on-premises server to CloudWatch Logs, making log data accessible without having to log in to each individual server. Log data can also be collected from multiple servers, such as an Auto Scaling fleet of web servers.

Creating a Metric Filter in CloudWatch Logs

I configured a filter to identify 404 Errors in the log file, which would normally indicate that the web server is generating invalid links that users are choosing:

Filter name: 404Errors
Metric namespace: LogMetrics
Metric name: 404Errors
Metric value: 1

This metric filter could now be used in an alarm.

Creating an Alarm Using the Filter

I configured an alarm to notify me when too many 404 errors occur:

Period: 1 minute
Conditions: Greater/Equal than 5
Alarm name: 404 Errors
Alarm description: Alert when too many 404s detected on an instance

To test the alarm, I:

This demonstrated how to create alarms from application log data and receive alerts for unusual behavior, with the log file accessible within CloudWatch Logs for further analysis.

Task 3: Monitoring Instance Metrics Using CloudWatch

Metrics are data about the performance of systems. CloudWatch stores metrics for the AWS services used, and I can also publish my own application metrics either via the CloudWatch agent or directly from applications. CloudWatch can present the metrics for search, graphs, dashboards, and alarms.

I examined EC2 metrics:

These metrics don't give insight into what's running inside the instance, such as measuring free memory or free disk space. Fortunately, the CloudWatch agent runs inside the instance to collect these internal metrics.

To view the CloudWatch agent metrics:

Task 4: Creating Real-Time Notifications

CloudWatch Events delivers a near-real-time stream of system events describing changes in AWS resources. Simple rules can match events and route them to target functions or streams. CloudWatch Events becomes aware of operational changes as they occur.

CloudWatch Events can respond to operational changes, take corrective action, send messages to respond to the environment, activate functions, make changes, and capture state information. It can also schedule automated actions using cron or rate expressions.

I created a real-time notification for instance state changes:

Service Name: EC2
Event Type: EC2 Instance State-change Notification
Selected the checkbox for Specific state(s)
From the dropdown menu, selected stopped and terminated

In the Targets section, I:

Configure a Real-Time Notification

I could configure Amazon Simple Notification Service (Amazon SNS) to send notifications to my phone via SMS or to my email. Since configuring SMS messaging requires opening a ticket with AWS Support and takes time to configure, I used email instead.

I noted that more information about configuring SMS messaging with SNS is available in the Amazon Simple Notification Service Developer Guide.

I noted that to receive a more readable message, I could create an AWS Lambda function triggered by CloudWatch Events. The Lambda function could format a more readable message and send it via Amazon SNS.

This demonstrated how to receive real-time notifications when infrastructure changes.

Task 5: Monitoring for Infrastructure Compliance

With AWS Config, I can assess, audit, and evaluate the configurations of AWS resources. AWS Config continuously monitors and records AWS resource configurations and allows automated evaluation of recorded configurations against desired configurations.

AWS Config lets me review changes in configurations and relationships between AWS resources, dive into detailed resource configuration histories, and determine overall compliance against configurations specified in internal guidelines. It simplifies compliance auditing, security analysis, change management, and operational troubleshooting.

I set up AWS Config rules:

Get started
Next
Next
Confirm

This rule looks for resources without a project tag. It takes a few minutes to complete, so I continued with the next steps.

I then added a rule to check for unused EBS volumes:

The results showed:

I learned that AWS Config has a large library of pre-defined compliance checks, and custom checks can be created using Lambda.

In conclusion, this work gave me hands-on experience with AWS monitoring tools that are essential for maintaining reliable systems. I learned how to collect and analyze logs, track metrics both from outside and inside instances, receive real-time notifications of infrastructure changes, and ensure compliance with organizational standards.

Related Topics